ENH: Add ORC reader #29447

kkraus14 · 2019-11-06T21:54:14Z

closes DISCUSS: What would an ORC reader/writer API look like? #25229
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Added an ORC reader following the read_parquet API. Still need to give some additional love to the docstrings but this is at least ready for some discussion and eyes on it.

kkraus14 · 2019-11-06T21:54:57Z

Note: could also likely use some work to share the functionality between parquet and ORC readers where possible. There's a lot of copy paste currently in this PR.

kkraus14 · 2019-11-12T15:48:24Z

I unfortunately am not going to have the bandwidth to drive this to completion likely for a while, so if someone would like to take this over it would be welcome.

jreback · 2019-11-17T13:31:57Z

ok let's see what we can do here

WillAyd

lgtm

WillAyd · 2019-11-18T17:13:50Z

pandas/io/orc.py

+    DataFrame
+    """
+
+    impl = get_engine(engine)


Just out of curiosity what other readers do you see implementing here? I assume a lot of this extra code is for config management to return the PyArrowImpl

nothing, can prob simply this

Not sure if you still want to do this but commenting for context of next few comments

Yes, I would personally simplify this a lot by removing the class.
If at some point we actually want to have different engines, we can always use the approach of the parquet code then.

pandas/io/orc.py

WillAyd

Some minor things not blockers for me. Looks good overall

WillAyd · 2019-11-29T22:43:51Z

pandas/io/orc.py

+    DataFrame
+    """
+
+    impl = get_engine(engine)


Not sure if you still want to do this but commenting for context of next few comments

pandas/tests/io/test_orc.py

jorisvandenbossche

Added a few comments.

For test testing files, did you write them yourself, or do they come from somewhere? I was wondering if you could make the > 50KB ones a bit smaller.

jorisvandenbossche · 2019-12-02T12:20:41Z

doc/source/user_guide/io.rst

+for data frames. It is designed to make reading data frames efficient. Pandas provides *only* a reader for the
+ORC format, :func:`~pandas.read_orc`.
+
+See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ for more.


There is actually no documentation about ORC there, so that link is not very helpful ...

Can you mention this reader needs pyarrow to be installed instead?

see 2 lines above about the ORC format

docs are a bit sparse on the pyarrow side.

created: https://issues.apache.org/jira/browse/ARROW-7296

i was just looking through this PR to see who to ask about pyarrow.orc docs.

iirc @jorisvandenbossche did a PR in arrow side for this

pandas/core/config_init.py

jorisvandenbossche · 2019-12-02T12:23:10Z

pandas/io/orc.py

+    DataFrame
+    """
+
+    impl = get_engine(engine)


Yes, I would personally simplify this a lot by removing the class.
If at some point we actually want to have different engines, we can always use the approach of the parquet code then.

jreback · 2019-12-02T23:45:36Z

Added a few comments.

For test testing files, did you write them yourself, or do they come from somewhere? I was wondering if you could make the > 50KB ones a bit smaller.

IIRC @kkraus14 mentioned these were externally generated; these are pretty small so I don't think this is a big deal (and they are skipped when we generate the stripped builds anyhow (e.g. remove tests/data)

jreback · 2019-12-03T00:01:49Z

updated

jorisvandenbossche

IIRC @kkraus14 mentioned these were externally generated; these are pretty small so I don't think this is a big deal (and they are skipped when we generate the stripped builds anyhow (e.g. remove tests/data)

OK!

pandas/io/orc.py

doc/source/user_guide/io.rst

This reverts commit 19e4d47.

This reverts commit 6919a70.

jreback · 2019-12-10T12:28:23Z

this is good to go @jorisvandenbossche @WillAyd

jorisvandenbossche

Looks good to me!

(just 2 minor doc comments that can be applied online)

jorisvandenbossche · 2019-12-10T12:34:42Z

pandas/io/orc.py

+        By file-like object, we refer to objects with a ``read()`` method,
+        such as a file handler (e.g. via builtin ``open`` function)
+        or ``StringIO``.
+    columns : list, default=None


Suggested change

columns : list, default=None

columns : list, default None

jorisvandenbossche · 2019-12-10T12:37:22Z

pandas/io/orc.py

+    DataFrame
+    """
+
+    # we require a newer version of pyarrow thaN we support for parquet


Suggested change

# we require a newer version of pyarrow thaN we support for parquet

# we require a newer version of pyarrow than we support for parquet

jreback · 2019-12-10T13:14:04Z

@jorisvandenbossche wow I can't commit these on-line (likely because this is owned by @kkraus14 )

WillAyd

lgtm. very nice implementation

jorisvandenbossche · 2019-12-11T08:12:25Z

Thanks @kkraus14 and @jreback !

wow I can't commit these on-line (likely because this is owned by @kkraus14 )

Strange, as the branch is open to push commits by maintainers ..

This reverts commit 9b202d3.

jreback · 2019-12-11T11:15:49Z

lgtm. very nice implementation

actually this was on my office computer; we block push so likely that’s the issue; thanks for merge

kkraus14 · 2019-12-23T19:13:04Z

Apologies that I didn't have the bandwidth to drive this to completion. Thanks @jreback and @jorisvandenbossche for driving this to completion.

jreback · 2019-12-23T20:57:02Z

thanks for putting this up @kkraus14 !

alimcmaster1 added the IO Data IO issues that don't fit into a more specific label label Nov 6, 2019

jreback force-pushed the orc-reader branch from beb3ba9 to f2bb596 Compare November 17, 2019 13:31

jreback added this to the 1.0 milestone Nov 17, 2019

jreback force-pushed the orc-reader branch from ba68be8 to 2820441 Compare November 17, 2019 23:28

WillAyd reviewed Nov 18, 2019

View reviewed changes

kkraus14 commented Nov 19, 2019

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

kkraus14 commented Nov 19, 2019

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

WillAyd reviewed Nov 29, 2019

View reviewed changes

jreback force-pushed the orc-reader branch from fa60ce4 to 5ad5832 Compare December 2, 2019 00:54

jorisvandenbossche requested changes Dec 2, 2019

View reviewed changes

jreback force-pushed the orc-reader branch from 5ad5832 to 2870904 Compare December 3, 2019 00:01

jorisvandenbossche reviewed Dec 3, 2019

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

jreback force-pushed the orc-reader branch from 2870904 to 10b6e4b Compare December 8, 2019 15:40

kkraus14 and others added 12 commits December 8, 2019 15:12

add orc reader

810cd3c

black

6240f94

flake8 and more black

21ada9f

update docs

39518da

use min version of pyarrow

f4e4eb5

update doc-links & add typing

1fe30e9

simplify

5582a09

skip tests on windows

0d027e9

actually skip on windows

a4284d1

simplify imports

25cf714

clean impl

e8efceb

Revert "clean impl"

bf4f013

This reverts commit 19e4d47.

jreback added 6 commits December 8, 2019 15:12

Revert "simplify imports"

ad1bade

This reverts commit 6919a70.

remove option for multiple backends & simplify tests

ca016ef

small doc update

b846bff

fix doc error & make simpler

ebaec28

actually skip on windows

39b578d

skip on dep missing

8a203a6

jreback force-pushed the orc-reader branch from 5901590 to 8a203a6 Compare December 8, 2019 20:16

jorisvandenbossche approved these changes Dec 10, 2019

View reviewed changes

WillAyd approved these changes Dec 10, 2019

View reviewed changes

typo

884d61e

jorisvandenbossche changed the title ~~Add ORC reader~~ ENH: Add ORC reader Dec 11, 2019

jorisvandenbossche merged commit 9b202d3 into pandas-dev:master Dec 11, 2019

jreback added a commit that referenced this pull request Dec 11, 2019

Revert "ENH: Add ORC reader (#29447)"

5f40d7d

This reverts commit 9b202d3.

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Add ORC reader (pandas-dev#29447)

edcc382

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Add ORC reader (pandas-dev#29447)

96dc90f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add ORC reader #29447

ENH: Add ORC reader #29447

kkraus14 commented Nov 6, 2019

kkraus14 commented Nov 6, 2019

kkraus14 commented Nov 12, 2019

jreback commented Nov 17, 2019

WillAyd left a comment

WillAyd Nov 18, 2019

jreback Nov 18, 2019

WillAyd Nov 29, 2019

jorisvandenbossche Dec 2, 2019

WillAyd left a comment

WillAyd Nov 29, 2019

jorisvandenbossche left a comment

jorisvandenbossche Dec 2, 2019

jorisvandenbossche Dec 2, 2019

jreback Dec 2, 2019

jreback Dec 3, 2019

jreback Dec 3, 2019

jbrockmendel Dec 20, 2019

jreback Dec 20, 2019

jorisvandenbossche Dec 2, 2019

jreback commented Dec 2, 2019

jreback commented Dec 3, 2019

jorisvandenbossche left a comment

jreback commented Dec 10, 2019

jorisvandenbossche left a comment

jorisvandenbossche Dec 10, 2019

jorisvandenbossche Dec 10, 2019

jreback commented Dec 10, 2019

WillAyd left a comment

jorisvandenbossche commented Dec 11, 2019

jreback commented Dec 11, 2019

kkraus14 commented Dec 23, 2019

jreback commented Dec 23, 2019

	# we require a newer version of pyarrow thaN we support for parquet
	# we require a newer version of pyarrow than we support for parquet

ENH: Add ORC reader #29447

ENH: Add ORC reader #29447

Conversation

kkraus14 commented Nov 6, 2019

kkraus14 commented Nov 6, 2019

kkraus14 commented Nov 12, 2019

jreback commented Nov 17, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 2, 2019

jreback commented Dec 3, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback commented Dec 10, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 10, 2019

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 11, 2019

jreback commented Dec 11, 2019

kkraus14 commented Dec 23, 2019

jreback commented Dec 23, 2019